This is a Multivariate DataSet, with different characteristics which helps in understanding the factors which results in a higher quality of wine.
It’s a Tidy dataset. All the variables are having numeric data. The dataset consists of few variables which will help us to determine why a specific wine variant is having a particular rating.
Main Feature of Interest is Quality. Other features which will help support the investigation into the feature of interest are:
We’ll start with Univariate analysis i.ev. analyzing the variables individually. This will be done with the help of Histograms & Box-Plots which will help us understand the type of distribution the variables are having and also the amount of outliers(if any).
A new variable “COLOR” is created for easier Box-Plots.
Quality is having a positively skewed normal distribution. From the histogram we can see that most of the samples have quality 5, 6 & 7. The boxplot indicates that there are very few outliers present.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Fixed acidity is having a normal distribution. From the histogram we can see that most of the samples have a fixed acidity level between 6.25 and 7.25. Citric Acid is a part of Fixed Acidity. The boxplot indicates there are lot of outliers present in the dataset.
Summary Statistics for Fixed Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Volatile acidity is having a positively skewed normal distribution. From the histogram we can see that most of the samples have a volatile acidity level between 0.21 and 0.32. The boxplot indicates there are lot of outliers present in the dataset.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
pH is having a normal distribution. From the histogram we can see that most of the samples have pH level between 3.1 and 3.3. The boxplot indicates there are lot of outliers present in the dataset.
Summary Statistics for pH:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Residual Sugar is having a positively skewed normal distribution. From the histogram we can see that most of the samples have residual sugar level between 1.7 and 2.5. The boxplot indicates there are lot of outliers present in the dataset.
Summary Statistics for Residual Sugar:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Alcohol is having a positively skewed normal distribution. From the histogram we can see that most of the samples have . The boxplot indicates that there are very few outliers present.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Density is having a positively skewed normal distribution. From the histogram we can see that most of the samples between 0.990 & 0.998. The boxplot indicates that there are very few outliers present.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Chloride is having a positively skewed normal distribution. From the histogram we can see that most of the samples have chloride level between 0.03 and 0.55. The boxplot indicates that there are excessive outliers present.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Warning: Removed 17 rows containing non-finite values (stat_bin).
Sulphates is having a positively skewed normal distribution. From the histogram we can see that most of the samples have sulphate level between 0.40 and 0.55. The boxplot indicates that there are huge number of outliers present.
Summary Statistics for Volatile Acidity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
All the variables are normally distributed & mostly positively skewed. There are a lot of outliers which will be ignored in the future investigations.
In order to start the bivariate analysis, it’s required to understand the correlation between variables.
We’ll subset the data by removing the columns “X” & “color” as they are irrelevant for calculating correlation.
Positive Correlations
Negative Correlations
Fixed Acidity will always be having a negative correlation with pH level, as higher the acidity level, lower will be pH level & vice-versae.
Two compnents which are having a strong correlation with the other variables are:
Hence, the capability to affect the wine quality by these 2 variables is much more compared to the others. It is to be checked with how the above relations are affecting Quality.
Considering the above correlations, let’s take a closer look in the scatter prlots for better understanding how the variables are affecting the quality.
There is strong and positive correlation between Alcohol & Quality. Most of the wine samples belong to Quality 5, 6 & 7. For Quality = 5, most of the samples have lower alcohol content. For 6, the alcohol content is uniform i.e. both high & low alcohol content wine with quality = 6 can be found. For quality = 7, most of the samples are having higher content of alcohol.
Higher Residual Sugar will cuase a higher Density. Comparing how it’s affecting quality with the below plot.
Higher quality of wine, which generally has a higher alcohol content has lower residual sugar and hence the density is lower.
The use of sulfur dioxide (SO2) is widely accepted as a useful winemaking aide. It is used as a preservative because of its anti-oxidative and anti-microbial properties in wine, but also as a cleaning agent for barrels and winery facilities. SO2 as it is a by-product of fermentation. It’s having a positive correlation with density which means higher the presence of Sulfur Dioxide will create a wine with higher density.
The effect to quality is to be determined.
High Quality Wine (i.e. above 6), has lower Total Sulfur Dioxide content, lower Residual Sugar and lower Density.
Fixed Acidity & pH is having a strong negative correlation. Citric acid is having a strong +ve correlation with fixed acidity & a strong -ve correlation with pH.
Acidity & pH level are inversely related and hence the correlation is negative. Citric acid is considered with Fixed Acidity. Acid content is used in Wine as a preservative. From the above plots we can’t understand how acidity-level/pH-lvel is affecting the quality of wine.
As this dataset is of white wine which generally has high acidic content, major distinguishable features are not noticed here. A general trend is noticed that a high quality wine has a lower acidic content.
Higher Sulfur Dioxide content reduces the alcohol content which in turn reduces the wine quality. High quality wine are those which will have a higher alcohol content (above 11) with a total sulfur dioxide content below 175.
High presence of residual Sugar increases density which reduces the quality as seen before. From the above scatter plot it is seen that wine of quality 6, 7, 8 & 9 are having higher alcohol content and lower density.
As we’ve seen before, higher residual sugar results in higher density. From the above picture it’s clear that a wine higher alcohol content, which is having a lower residual sugar resulting in lower density tends to have a higher quality score.
Lower chloride content in wine has higher alcohol content. Higher quality wine has lower chloride content i.e. it’s less salty. If a wine is salty then the wine is more likely to of low quality.
Another interesting relationship is how Sulfur Dioxide content is affecting density & alcohol, which in turn becomes one of the major factor to determine the quality.
Strongest relationships are found in:
From the above bivariate analysis it’s clear that density and alcohol are the prime factors to determine White Wine Quality. These 2 variables are highly affected by Residual Sugar, Total Sulfur Dioxode & Chloride content. # Final Plots and Summary
The above factors will be analyzed with the help scatter plots. There are a significant number of outliers and so considering the data which are within the IQR.
For better understandng of the plots 2 new sub-variables are created from alcohol & density by creating slices.
Quality = 3, 4 & 5, has higher density because of higher residual sugar. Alcohol content is mostly between 8 to 10.
Quality = 6,7 & 8, has lower density, generally apporx 0.988 to 0.994 with a residual sugar less that 10. Average alcohol content is between 10 & 12.
Quality = 3, 4 & 5, has higher density with a total sulphar dioxide between 150 to 200. Alcohol content is mostly below 10.
Quality = 6,7 & 8, has lower density, generally apporx 0.988 to 0.994 with total sulfur dioxide content between 75 & 150. Average alcohol content is above 11.
From the above plot it’s clear that chloride content is mostly uniform for all quality of wine.
Analyzing the data we can come up the following conclusion:
Sugar, Density & Alchol are the major factor which are checked before grading the quality of a particular variety of wine. A wine with with low sugar content, low density & high alcohol content is more likely to be rated as a higher quality of wine.
It’s also notices that a wine of high quality (i.e. >= 6) has a lower Sulpfur Dioxide content. Mostly frequent quality level of white wine is 6.
Other variables like year of production, grape types, wine brand, type of oak used for tha barret etc. are not considered here. If these variables are considered we might get some more insights.